{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "authorship_tag": "ABX9TyNqlLD/LEX3MZ6Uw13WCE8x",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/sugarforever/wtf-langchain/blob/main/10_Example/10_Example.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 一个完整的例子\n",
        "\n",
        "这是该 `LangChain` 极简入门系列的最后一讲。我们将利用过去9讲学习的知识，来完成一个具备完整功能集的LLM应用。该应用基于 `LangChain` 框架，以某 `PDF` 文件的内容为知识库，提供给用户基于该文件内容的问答能力。\n",
        "\n",
        "我们利用 `LangChain` 的QA chain，结合 `Chroma` 来实现PDF文档的语义化搜索。示例代码所引用的是[AWS Serverless\n",
        "Developer Guide](https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf)，该PDF文档共84页。"
      ],
      "metadata": {
        "id": "69PRFT6WO-oK"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "1. 安装必要的 `Python` 包"
      ],
      "metadata": {
        "id": "OBehQYkOPPWe"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install -q langchain==0.0.235 openai chromadb pymupdf tiktoken"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "_amYPxT-PULc",
        "outputId": "d3d7515d-b214-4140-b39b-a9a209cf00b9"
      },
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "\u001b[?25l     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/1.3 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K     \u001b[91m━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.2/1.3 MB\u001b[0m \u001b[31m6.5 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.7/1.3 MB\u001b[0m \u001b[31m10.1 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━\u001b[0m \u001b[32m1.2/1.3 MB\u001b[0m \u001b[31m11.9 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m10.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m73.6/73.6 kB\u001b[0m \u001b[31m8.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m405.5/405.5 kB\u001b[0m \u001b[31m13.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.1/14.1 MB\u001b[0m \u001b[31m44.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m57.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m90.0/90.0 kB\u001b[0m \u001b[31m12.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.1/3.1 MB\u001b[0m \u001b[31m59.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h  Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
            "  Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
            "  Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.4/58.4 kB\u001b[0m \u001b[31m6.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m59.5/59.5 kB\u001b[0m \u001b[31m7.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.3/5.3 MB\u001b[0m \u001b[31m74.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.9/5.9 MB\u001b[0m \u001b[31m84.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.8/7.8 MB\u001b[0m \u001b[31m103.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m67.3/67.3 kB\u001b[0m \u001b[31m8.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h  Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
            "  Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
            "  Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.4/49.4 kB\u001b[0m \u001b[31m5.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m67.0/67.0 kB\u001b[0m \u001b[31m7.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m5.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m7.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m428.8/428.8 kB\u001b[0m \u001b[31m36.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.1/4.1 MB\u001b[0m \u001b[31m116.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m78.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m129.9/129.9 kB\u001b[0m \u001b[31m15.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m11.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h  Building wheel for chroma-hnswlib (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
            "  Building wheel for pypika (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "2. 设置OpenAI环境"
      ],
      "metadata": {
        "id": "8Hihrnw_PeIA"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import os\n",
        "os.environ['OPENAI_API_KEY'] = ''"
      ],
      "metadata": {
        "id": "dALQoneUPgEH"
      },
      "execution_count": 2,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "3. 下载PDF文件AWS Serverless Developer Guide"
      ],
      "metadata": {
        "id": "8aB0OBRFP5FC"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!wget https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf\n",
        "\n",
        "PDF_NAME = 'serverless-core.pdf'"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "zF-PFO9BP6wr",
        "outputId": "1d761def-1df0-4043-f00e-be6dd913f1b2"
      },
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "--2023-08-17 11:42:20--  https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf\n",
            "Resolving docs.aws.amazon.com (docs.aws.amazon.com)... 108.159.227.88, 108.159.227.51, 108.159.227.3, ...\n",
            "Connecting to docs.aws.amazon.com (docs.aws.amazon.com)|108.159.227.88|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 4727395 (4.5M) [application/pdf]\n",
            "Saving to: ‘serverless-core.pdf’\n",
            "\n",
            "serverless-core.pdf 100%[===================>]   4.51M  12.0MB/s    in 0.4s    \n",
            "\n",
            "2023-08-17 11:42:21 (12.0 MB/s) - ‘serverless-core.pdf’ saved [4727395/4727395]\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "4. 加载PDF文件"
      ],
      "metadata": {
        "id": "WqBDCt0HQFAA"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from langchain.document_loaders import PyMuPDFLoader\n",
        "docs = PyMuPDFLoader(PDF_NAME).load()\n",
        "\n",
        "print (f'There are {len(docs)} document(s) in {PDF_NAME}.')\n",
        "print (f'There are {len(docs[0].page_content)} characters in the first page of your document.')"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "bniPzdhUQSlw",
        "outputId": "01342468-bef4-4e9f-9b56-1ab6eac65ab4"
      },
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "There are 84 document(s) in serverless-core.pdf.\n",
            "There are 27 characters in the first page of your document.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "5. 拆分文档并存储文本嵌入的向量数据"
      ],
      "metadata": {
        "id": "V9kvXY9uQ1mI"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from langchain.embeddings.openai import OpenAIEmbeddings\n",
        "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
        "from langchain.vectorstores import Chroma\n",
        "\n",
        "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)\n",
        "split_docs = text_splitter.split_documents(docs)\n",
        "\n",
        "embeddings = OpenAIEmbeddings()\n",
        "\n",
        "vectorstore = Chroma.from_documents(split_docs, embeddings, collection_name=\"serverless_guide\")"
      ],
      "metadata": {
        "id": "G4d8cwQTQ2fa"
      },
      "execution_count": 5,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "6. 基于OpenAI创建QA链"
      ],
      "metadata": {
        "id": "-T6_mIR8RwEF"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from langchain.llms import OpenAI\n",
        "from langchain.chains.question_answering import load_qa_chain\n",
        "\n",
        "llm = OpenAI(temperature=0)\n",
        "chain = load_qa_chain(llm, chain_type=\"stuff\")"
      ],
      "metadata": {
        "id": "BsW99LnUR2Ns"
      },
      "execution_count": 6,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "7. 基于提问，进行相似性查询"
      ],
      "metadata": {
        "id": "ED54hPgfSXYL"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "query = \"What is the use case of AWS Serverless?\"\n",
        "similar_docs = vectorstore.similarity_search(query, 3, include_metadata=True)"
      ],
      "metadata": {
        "id": "bPmKM4zXSam9"
      },
      "execution_count": 7,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "similar_docs"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "DNog2ekVSxPa",
        "outputId": "78bf8f0c-bdf2-4b9c-8a08-433c7706b758"
      },
      "execution_count": 8,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[Document(page_content='Serverless\\nDeveloper Guide', metadata={'author': 'AWS', 'creationDate': 'D:20230817052259Z', 'creator': 'ZonBook XSL Stylesheets with Apache FOP', 'file_path': 'serverless-core.pdf', 'format': 'PDF 1.4', 'keywords': 'Serverless, serverless guide, getting started serverless, event-driven architecture, Lambda, API Gateway, DynamoDB, serverless, developer, guide, learn serverless, serverless, use-case, serverless, prerequisites, serverless, serverless, fundamentals, even-driven, architecture, serverless, fundamentals, serverless, developer_experience, lifecycle, deploy, packaging, serverless, hands-on, tutorial, workshop, next steps, security, serverless, compute, api, gateway, serverless, database, nosql', 'modDate': '', 'page': 0, 'producer': 'Apache FOP Version 2.6', 'source': 'serverless-core.pdf', 'subject': '', 'title': 'Serverless - Developer Guide', 'total_pages': 84, 'trapped': ''}),\n",
              " Document(page_content='needed to build serverless solutions.\\nIn serverless solutions, you focus on writing code that serves your customers, without managing servers. \\nServerless technologies are pay-as-you-go, can scale both up and down, and are easy to expand across \\ngeographic regions.\\nMeeting you where you are now\\nWe expect that you are a developer with some traditional web application development experience, \\nbut you are new to Amazon Web Services or serverless architectures. We also assume you want to get \\n1', metadata={'author': 'AWS', 'creationDate': 'D:20230817052259Z', 'creator': 'ZonBook XSL Stylesheets with Apache FOP', 'file_path': 'serverless-core.pdf', 'format': 'PDF 1.4', 'keywords': 'Serverless, serverless guide, getting started serverless, event-driven architecture, Lambda, API Gateway, DynamoDB, serverless, developer, guide, learn serverless, serverless, use-case, serverless, prerequisites, serverless, serverless, fundamentals, even-driven, architecture, serverless, fundamentals, serverless, developer_experience, lifecycle, deploy, packaging, serverless, hands-on, tutorial, workshop, next steps, security, serverless, compute, api, gateway, serverless, database, nosql', 'modDate': '', 'page': 4, 'producer': 'Apache FOP Version 2.6', 'source': 'serverless-core.pdf', 'subject': '', 'title': 'Serverless - Developer Guide', 'total_pages': 84, 'trapped': ''}),\n",
              " Document(page_content='infrastructure (regions, ARNs, security model). Then, it will introduce the shift in mindset you need to \\nmake to start your serverless journey. Next, it will dive into event-driven architecture and other serverless \\nconcepts necessary to transition to serverless.\\nAlong the way, the guide will provide a list of curated resources, such as articles, workshops, and \\ntutorials, to reinforce your learning with hands-on activities.\\nThe guide will focus on common serverless use cases, such as:\\n• Interactive Web- and API-based microservices or applications\\n• Data processing applications\\n• Real-time streaming applications\\n• Machine learning\\n• IT automation and service orchestration\\nTo avoid information overload, scope will be limited to a few essential serverless services to get started:\\n• AWS Identity and Access Management for service security\\n• AWS Lambda for compute functions\\n• Amazon API Gateway for integrating HTTP/S requests with other services that handle the requests', metadata={'author': 'AWS', 'creationDate': 'D:20230817052259Z', 'creator': 'ZonBook XSL Stylesheets with Apache FOP', 'file_path': 'serverless-core.pdf', 'format': 'PDF 1.4', 'keywords': 'Serverless, serverless guide, getting started serverless, event-driven architecture, Lambda, API Gateway, DynamoDB, serverless, developer, guide, learn serverless, serverless, use-case, serverless, prerequisites, serverless, serverless, fundamentals, even-driven, architecture, serverless, fundamentals, serverless, developer_experience, lifecycle, deploy, packaging, serverless, hands-on, tutorial, workshop, next steps, security, serverless, compute, api, gateway, serverless, database, nosql', 'modDate': '', 'page': 5, 'producer': 'Apache FOP Version 2.6', 'source': 'serverless-core.pdf', 'subject': '', 'title': 'Serverless - Developer Guide', 'total_pages': 84, 'trapped': ''})]"
            ]
          },
          "metadata": {},
          "execution_count": 8
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "8. 基于相关文档，利用QA链完成回答"
      ],
      "metadata": {
        "id": "1XecjykTSnve"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "chain.run(input_documents=similar_docs, question=query)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 54
        },
        "id": "E4YOeM8aSuEY",
        "outputId": "14ccdea1-f586-45cf-bf4c-cce99322739c"
      },
      "execution_count": 9,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "' AWS Serverless can be used for interactive web- and API-based microservices or applications, data processing applications, real-time streaming applications, machine learning, and IT automation and service orchestration.'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 9
        }
      ]
    }
  ]
}